sparkLoading External Data
For this project, we chose to incorproate two additional datasets. We hope these external datasets can contextualize some of the results we see in the reddit comments.
The first external dataset is the weekly rankings of collegiate teams. We hope this provides some context from impartial observers who follow collegiate basketball and can provide some context as to how good – or bad – teams are compared to other national teams.
The second external dataset is the game-by-game results from the two teams we chose to focus on - UNC and Kansas. This will help provide us more information on how these teams are doing week in and week out.
Below, we load that data here.
import pandas as pd
import requests
import matplotlib as plt
import plotly.express as pxurl = 'https://www.sports-reference.com/cbb/seasons/men/2022-polls.html'
rankings = pd.read_html(url, header=2)[0]
rankings = rankings.drop('Conf', axis=1)
rankings_transpose = rankings.T
rankings_transpose = rankings_transpose.reset_index(drop=True)
new_header = rankings_transpose.iloc[0] #grab the first row for the header
rankings_transpose = rankings_transpose[1:] #take the data less the header row
rankings_transpose.columns = new_header
for column in rankings_transpose.columns:
rankings_transpose[column] = pd.to_numeric(rankings_transpose[column], errors='coerce')
rankings_transpose.dtypes
rankings_transpose = rankings_transpose.reset_index(drop=True)
rankings_transpose = rankings_transpose.drop(rankings_transpose.columns[20], axis=1)
rankings_transpose| Alabama | Arizona | Arkansas | Auburn | BYU | Baylor | Boise State | Colorado State | Davidson | Duke | ... | Texas Tech | UCLA | UConn | UNC | USC | Villanova | Virginia | Wisconsin | Wyoming | Xavier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.0 | NaN | 16.0 | 22 | NaN | 8 | NaN | NaN | NaN | 9 | ... | NaN | 2 | 24.0 | 19.0 | NaN | 4 | 25.0 | NaN | NaN | NaN |
| 1 | 14.0 | NaN | 16.0 | 21 | NaN | 9 | NaN | NaN | NaN | 7 | ... | NaN | 2 | 23.0 | 18.0 | 25.0 | 5 | NaN | NaN | NaN | NaN |
| 2 | 10.0 | 17.0 | 13.0 | 19 | 18.0 | 6 | NaN | NaN | NaN | 5 | ... | NaN | 2 | 22.0 | NaN | 24.0 | 7 | NaN | NaN | NaN | 25.0 |
| 3 | 16.0 | 11.0 | 10.0 | 21 | 12.0 | 4 | NaN | NaN | NaN | 1 | ... | NaN | 5 | 17.0 | NaN | 20.0 | 6 | NaN | 23.0 | NaN | NaN |
| 4 | 9.0 | 11.0 | 12.0 | 18 | 24.0 | 2 | NaN | NaN | NaN | 3 | ... | NaN | 4 | 15.0 | NaN | 16.0 | 6 | NaN | 22.0 | NaN | NaN |
| 5 | 6.0 | 8.0 | 24.0 | 13 | NaN | 1 | NaN | 23.0 | NaN | 2 | ... | 25.0 | 4 | 20.0 | NaN | 10.0 | 9 | NaN | NaN | NaN | 22.0 |
| 6 | 10.0 | 6.0 | NaN | 12 | NaN | 1 | NaN | 21.0 | NaN | 2 | ... | 25.0 | 5 | NaN | NaN | 8.0 | 23 | NaN | 24.0 | NaN | 18.0 |
| 7 | 19.0 | 9.0 | NaN | 11 | NaN | 1 | NaN | 20.0 | NaN | 2 | ... | 25.0 | 5 | NaN | NaN | 7.0 | 22 | NaN | 24.0 | NaN | 23.0 |
| 8 | 15.0 | 8.0 | NaN | 9 | NaN | 1 | NaN | 20.0 | NaN | 2 | ... | 25.0 | 5 | NaN | NaN | 7.0 | 19 | NaN | 23.0 | NaN | 22.0 |
| 9 | 24.0 | 6.0 | NaN | 4 | NaN | 1 | NaN | NaN | NaN | 8 | ... | 19.0 | 3 | NaN | NaN | 5.0 | 14 | NaN | 13.0 | NaN | 17.0 |
| 10 | NaN | 3.0 | NaN | 2 | NaN | 5 | NaN | NaN | NaN | 6 | ... | 18.0 | 9 | 25.0 | NaN | 16.0 | 11 | NaN | 8.0 | NaN | 20.0 |
| 11 | NaN | 3.0 | NaN | 1 | NaN | 4 | NaN | NaN | 25.0 | 9 | ... | 13.0 | 7 | 20.0 | NaN | 15.0 | 14 | NaN | 11.0 | NaN | 21.0 |
| 12 | NaN | 7.0 | NaN | 1 | NaN | 8 | NaN | NaN | NaN | 9 | ... | 14.0 | 3 | 17.0 | NaN | 19.0 | 12 | NaN | 11.0 | NaN | 21.0 |
| 13 | NaN | 4.0 | NaN | 1 | NaN | 10 | NaN | NaN | NaN | 7 | ... | 9.0 | 12 | 24.0 | NaN | 21.0 | 15 | NaN | 14.0 | NaN | 25.0 |
| 14 | 25.0 | 3.0 | 23.0 | 2 | NaN | 7 | NaN | NaN | NaN | 9 | ... | 11.0 | 13 | 24.0 | NaN | 17.0 | 10 | NaN | 15.0 | 22.0 | NaN |
| 15 | 24.0 | 2.0 | 18.0 | 3 | NaN | 10 | NaN | NaN | NaN | 7 | ... | 9.0 | 12 | 21.0 | NaN | 16.0 | 8 | NaN | 13.0 | NaN | NaN |
| 16 | 25.0 | 2.0 | 14.0 | 5 | NaN | 3 | NaN | NaN | NaN | 4 | ... | 12.0 | 17 | 18.0 | NaN | 16.0 | 11 | NaN | 10.0 | NaN | NaN |
| 17 | NaN | 2.0 | 15.0 | 4 | NaN | 3 | NaN | 23.0 | NaN | 7 | ... | 14.0 | 13 | 20.0 | 25.0 | 21.0 | 8 | NaN | 12.0 | NaN | NaN |
| 18 | NaN | 2.0 | 17.0 | 8 | NaN | 4 | 23.0 | 24.0 | NaN | 9 | ... | 12.0 | 11 | 21.0 | NaN | 22.0 | 6 | NaN | 14.0 | NaN | NaN |
19 rows × 47 columns
url_kansas = 'https://www.sports-reference.com/cbb/schools/kansas/men/2022-schedule.html'
kansas_schedule = pd.read_html(url_kansas, header=0)[1]
kansas_schedule = kansas_schedule[~kansas_schedule['Date'].str.contains('date', case=False, na=False)]
columns_keep = ['Date', 'Opponent', 'Unnamed: 8', 'Tm', 'Opp', 'OT', 'W', 'L', 'Streak']
kansas_schedule = kansas_schedule[columns_keep]
kansas_schedule| Date | Opponent | Unnamed: 8 | Tm | Opp | OT | W | L | Streak | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Tue, Nov 9, 2021 | Michigan State | W | 87 | 74 | NaN | 1 | 0 | W 1 |
| 1 | Fri, Nov 12, 2021 | Tarleton State | W | 88 | 62 | NaN | 2 | 0 | W 2 |
| 2 | Thu, Nov 18, 2021 | Stony Brook | W | 88 | 59 | NaN | 3 | 0 | W 3 |
| 3 | Thu, Nov 25, 2021 | North Texas | W | 71 | 59 | NaN | 4 | 0 | W 4 |
| 4 | Fri, Nov 26, 2021 | Dayton | L | 73 | 74 | NaN | 4 | 1 | L 1 |
| 5 | Sun, Nov 28, 2021 | Iona | W | 96 | 83 | NaN | 5 | 1 | W 1 |
| 6 | Fri, Dec 3, 2021 | St. John's (NY) | W | 95 | 75 | NaN | 6 | 1 | W 2 |
| 7 | Tue, Dec 7, 2021 | UTEP | W | 78 | 52 | NaN | 7 | 1 | W 3 |
| 8 | Sat, Dec 11, 2021 | Missouri | W | 102 | 65 | NaN | 8 | 1 | W 4 |
| 9 | Sat, Dec 18, 2021 | Stephen F. Austin | W | 80 | 72 | NaN | 9 | 1 | W 5 |
| 10 | Wed, Dec 29, 2021 | Nevada | W | 88 | 61 | NaN | 10 | 1 | W 6 |
| 11 | Sat, Jan 1, 2022 | George Mason | W | 76 | 67 | NaN | 11 | 1 | W 7 |
| 12 | Tue, Jan 4, 2022 | Oklahoma State | W | 74 | 63 | NaN | 12 | 1 | W 8 |
| 13 | Sat, Jan 8, 2022 | Texas Tech (25) | L | 67 | 75 | NaN | 12 | 2 | L 1 |
| 14 | Tue, Jan 11, 2022 | Iowa State (15) | W | 62 | 61 | NaN | 13 | 2 | W 1 |
| 15 | Sat, Jan 15, 2022 | West Virginia | W | 85 | 59 | NaN | 14 | 2 | W 2 |
| 16 | Tue, Jan 18, 2022 | Oklahoma | W | 67 | 64 | NaN | 15 | 2 | W 3 |
| 17 | Sat, Jan 22, 2022 | Kansas State | W | 78 | 75 | NaN | 16 | 2 | W 4 |
| 18 | Mon, Jan 24, 2022 | Texas Tech (13) | W | 94 | 91 | 2OT | 17 | 2 | W 5 |
| 19 | Sat, Jan 29, 2022 | Kentucky (12) | L | 62 | 80 | NaN | 17 | 3 | L 1 |
| 20 | Tue, Feb 1, 2022 | Iowa State (20) | W | 70 | 61 | NaN | 18 | 3 | W 1 |
| 21 | Sat, Feb 5, 2022 | Baylor (8) | W | 83 | 59 | NaN | 19 | 3 | W 2 |
| 22 | Mon, Feb 7, 2022 | Texas (20) | L | 76 | 79 | NaN | 19 | 4 | L 1 |
| 23 | Sat, Feb 12, 2022 | Oklahoma | W | 71 | 69 | NaN | 20 | 4 | W 1 |
| 24 | Mon, Feb 14, 2022 | Oklahoma State | W | 76 | 62 | NaN | 21 | 4 | W 2 |
| 25 | Sat, Feb 19, 2022 | West Virginia | W | 71 | 58 | NaN | 22 | 4 | W 3 |
| 26 | Tue, Feb 22, 2022 | Kansas State | W | 102 | 83 | NaN | 23 | 4 | W 4 |
| 27 | Sat, Feb 26, 2022 | Baylor (10) | L | 70 | 80 | NaN | 23 | 5 | L 1 |
| 28 | Tue, Mar 1, 2022 | TCU | L | 64 | 74 | NaN | 23 | 6 | L 2 |
| 29 | Thu, Mar 3, 2022 | TCU | W | 72 | 68 | NaN | 24 | 6 | W 1 |
| 30 | Sat, Mar 5, 2022 | Texas (21) | W | 70 | 63 | OT | 25 | 6 | W 2 |
| 32 | Thu, Mar 10, 2022 | West Virginia | W | 87 | 63 | NaN | 26 | 6 | W 3 |
| 33 | Fri, Mar 11, 2022 | TCU | W | 75 | 62 | NaN | 27 | 6 | W 4 |
| 34 | Sat, Mar 12, 2022 | Texas Tech (14) | W | 74 | 65 | NaN | 28 | 6 | W 5 |
| 36 | Thu, Mar 17, 2022 | Texas Southern | W | 83 | 56 | NaN | 29 | 6 | W 6 |
| 37 | Sat, Mar 19, 2022 | Creighton | W | 79 | 72 | NaN | 30 | 6 | W 7 |
| 38 | Fri, Mar 25, 2022 | Providence (13) | W | 66 | 61 | NaN | 31 | 6 | W 8 |
| 39 | Sun, Mar 27, 2022 | Miami (FL) | W | 76 | 50 | NaN | 32 | 6 | W 9 |
| 40 | Sat, Apr 2, 2022 | Villanova (6) | W | 81 | 65 | NaN | 33 | 6 | W 10 |
| 41 | Mon, Apr 4, 2022 | North Carolina | W | 72 | 69 | NaN | 34 | 6 | W 11 |
unc_url = 'https://www.sports-reference.com/cbb/schools/north-carolina/men/2022-schedule.html'
unc_schedule = pd.read_html(unc_url, header=0)[1]
unc_schedule = unc_schedule[~unc_schedule['Date'].str.contains('date', case=False, na=False)]
unc_schedule = unc_schedule[columns_keep]
unc_schedule| Date | Opponent | Unnamed: 8 | Tm | Opp | OT | W | L | Streak | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Tue, Nov 9, 2021 | Loyola (MD) | W | 83 | 67 | NaN | 1 | 0 | W 1 |
| 1 | Fri, Nov 12, 2021 | Brown | W | 94 | 87 | NaN | 2 | 0 | W 2 |
| 2 | Tue, Nov 16, 2021 | College of Charleston | W | 94 | 83 | NaN | 3 | 0 | W 3 |
| 3 | Sat, Nov 20, 2021 | Purdue (6) | L | 84 | 93 | NaN | 3 | 1 | L 1 |
| 4 | Sun, Nov 21, 2021 | Tennessee (17) | L | 72 | 89 | NaN | 3 | 2 | L 2 |
| 5 | Tue, Nov 23, 2021 | UNC Asheville | W | 72 | 53 | NaN | 4 | 2 | W 1 |
| 6 | Wed, Dec 1, 2021 | Michigan (24) | W | 72 | 51 | NaN | 5 | 2 | W 2 |
| 7 | Sun, Dec 5, 2021 | Georgia Tech | W | 79 | 62 | NaN | 6 | 2 | W 3 |
| 8 | Sat, Dec 11, 2021 | Elon | W | 80 | 63 | NaN | 7 | 2 | W 4 |
| 9 | Tue, Dec 14, 2021 | Furman | W | 74 | 61 | NaN | 8 | 2 | W 5 |
| 10 | Sat, Dec 18, 2021 | Kentucky (21) | L | 69 | 98 | NaN | 8 | 3 | L 1 |
| 11 | Tue, Dec 21, 2021 | Appalachian State | W | 70 | 50 | NaN | 9 | 3 | W 1 |
| 12 | Sun, Jan 2, 2022 | Boston College | W | 91 | 65 | NaN | 10 | 3 | W 2 |
| 13 | Wed, Jan 5, 2022 | Notre Dame | L | 73 | 78 | NaN | 10 | 4 | L 1 |
| 14 | Sat, Jan 8, 2022 | Virginia | W | 74 | 58 | NaN | 11 | 4 | W 1 |
| 15 | Sat, Jan 15, 2022 | Georgia Tech | W | 88 | 65 | NaN | 12 | 4 | W 2 |
| 16 | Tue, Jan 18, 2022 | Miami (FL) | L | 57 | 85 | NaN | 12 | 5 | L 1 |
| 17 | Sat, Jan 22, 2022 | Wake Forest | L | 76 | 98 | NaN | 12 | 6 | L 2 |
| 18 | Mon, Jan 24, 2022 | Virginia Tech | W | 78 | 68 | NaN | 13 | 6 | W 1 |
| 19 | Wed, Jan 26, 2022 | Boston College | W | 58 | 47 | NaN | 14 | 6 | W 2 |
| 20 | Sat, Jan 29, 2022 | NC State | W | 100 | 80 | NaN | 15 | 6 | W 3 |
| 21 | Tue, Feb 1, 2022 | Louisville | W | 90 | 83 | OT | 16 | 6 | W 4 |
| 22 | Sat, Feb 5, 2022 | Duke (9) | L | 67 | 87 | NaN | 16 | 7 | L 1 |
| 23 | Tue, Feb 8, 2022 | Clemson | W | 79 | 77 | NaN | 17 | 7 | W 1 |
| 24 | Sat, Feb 12, 2022 | Florida State | W | 94 | 74 | NaN | 18 | 7 | W 2 |
| 25 | Wed, Feb 16, 2022 | Pittsburgh | L | 67 | 76 | NaN | 18 | 8 | L 1 |
| 26 | Sat, Feb 19, 2022 | Virginia Tech | W | 65 | 57 | NaN | 19 | 8 | W 1 |
| 27 | Mon, Feb 21, 2022 | Louisville | W | 70 | 63 | NaN | 20 | 8 | W 2 |
| 28 | Sat, Feb 26, 2022 | NC State | W | 84 | 74 | NaN | 21 | 8 | W 3 |
| 29 | Mon, Feb 28, 2022 | Syracuse | W | 88 | 79 | OT | 22 | 8 | W 4 |
| 30 | Sat, Mar 5, 2022 | Duke (4) | W | 94 | 81 | NaN | 23 | 8 | W 5 |
| 32 | Thu, Mar 10, 2022 | Virginia | W | 63 | 43 | NaN | 24 | 8 | W 6 |
| 33 | Fri, Mar 11, 2022 | Virginia Tech | L | 59 | 72 | NaN | 24 | 9 | L 1 |
| 35 | Thu, Mar 17, 2022 | Marquette | W | 95 | 63 | NaN | 25 | 9 | W 1 |
| 36 | Sat, Mar 19, 2022 | Baylor (4) | W | 93 | 86 | OT | 26 | 9 | W 2 |
| 37 | Fri, Mar 25, 2022 | UCLA (11) | W | 73 | 66 | NaN | 27 | 9 | W 3 |
| 38 | Sun, Mar 27, 2022 | Saint Peter's | W | 69 | 49 | NaN | 28 | 9 | W 4 |
| 39 | Sat, Apr 2, 2022 | Duke (9) | W | 81 | 77 | NaN | 29 | 9 | W 5 |
| 40 | Mon, Apr 4, 2022 | Kansas (3) | L | 69 | 72 | NaN | 29 | 10 | L 1 |
import matplotlib.pyplot as plt
rankings_transpose.plot(kind='line', marker='o')
# Adding labels and title
plt.xlabel('Time')
plt.ylabel('Values')
plt.title('Line Graph of Team Values Over Time')
# Display the plot
plt.legend(bbox_to_anchor=(1.1, 1.05))
plt.show()
fig = px.line(rankings_transpose,
x=rankings_transpose.index,
y="Kansas",
title="Weekly Ranking of Kansas",
labels={"index":"Week of Ranking", "Kansas":"Ranking"})
fig.show()
fig = px.line(rankings_transpose,
x=rankings_transpose.index,
y="UNC",
title="Weekly Ranking of UNC",
labels={"index":"Week of Ranking", "Kansas":"Ranking"})
fig.show()Above, we see what the external data looks like, and we also see how our individual teams fared throughout the season. The weekly rankings show a lot of change, and even when we zoom in on our two teams, there is a lot of movement as well (or, in the case of UNC, they are barely ranked!). This will help inform some of our other analysis as we continue.